Skip to content

feat: DashboardHygieneAnalyzer (broken panels)#23

Merged
cicdteam merged 21 commits into
mainfrom
dashboard-hygiene
May 23, 2026
Merged

feat: DashboardHygieneAnalyzer (broken panels)#23
cicdteam merged 21 commits into
mainfrom
dashboard-hygiene

Conversation

@cicdteam

Copy link
Copy Markdown
Contributor

Summary

Adds the last v0.1 analyzer: flags Grafana dashboards whose panel queries reference Prometheus metrics that do not exist (not in head series, not in recording-rule outputs). One finding per (dashboard, missing-metric) pair, severity Medium.

Scope narrowed from spec §6.3: ships only broken-panel detection in v0.1. Untouched-dashboard detection (weak proxy without Grafana Enterprise meta.viewedAt) and near-duplicate detection (no canonical "panel signature" definition) are deferred per .claude/docs/superpowers/specs/2026-05-23-dashboard-sprawl-analyzer-design.md §11.

What's new

  • Analyzer internal/analyzers/dashboardhygiene/ with happy-path detection, recording-rule resolution, VM-without-vmalert graceful-degrade, silent-skip for template-variable + non-prom datasources, fix-snippet builder.
  • CLI remetric dashboards broken --prometheus <URL> --grafana <URL> (both flags required). Honors all standard flags including new --ignore-dashboard <regex>.
  • Wired into scan and report runner slices (no-Grafana → warning, parallel to unusedmetrics).
  • Types:
    • Finding.Dashboard string field with omitempty.
    • ignore.Patterns.Dashboard + matching --ignore-dashboard flag.
    • Renamed ClassDashboardSprawlClassBrokenPanel, CategoryDashboardSprawlCategoryDashboardHygiene (no live emitters of the old names).
  • Grafana client additive extensions: Dashboard.PanelTargets() (flat panel-title + expr pairs) and Client.BaseURL() (defensive copy).
  • promqlx fix: isSentinel switched from equality to substring containment. Catches concatenations like ${metric}_total__remetric_var___total that previously leaked into the extracted metric set, polluting findings in both dashboardhygiene and unusedmetrics.
  • Docs docs/findings/broken-panel.md replaces the dashboard-sprawl.md placeholder; mkdocs nav + cross-link in unused-metric.md + README + --help text all updated.
  • E2E e2e/dashboards_e2e_test.go provisions a broken-panel dashboard via file-based Grafana provisioning, asserts the finding.

Test plan

  • go test ./... -count=1 -race (20 packages, all PASS)
  • make fmt vet lint vuln (0 issues, no vulnerabilities)
  • make cover (total 86.1%, dashboardhygiene 85.1% — exceeds 75% floor + 80% target)
  • make e2e (all 8 e2e tests PASS including new TestE2E_DashboardsBroken_JSON)
  • CI green on this PR before merge

Commits

21 commits with per-task two-stage review (spec compliance → code quality), each commit + fixup is independently buildable + tested. Squash-friendly history; bisect-friendly if anything regresses later.

cicdteam added 21 commits May 23, 2026 01:02
…panel/dashboard-hygiene

Scope narrows in v0.1 to only broken-panel detection; the old
names had no live emitters.
Used by DashboardHygieneAnalyzer to carry the dashboard title.
omitempty keeps the wire form clean for non-dashboard findings.
Anchored regex against Finding.Dashboard. Empty field never matches.
Wires through config.IgnoreConfig.Dashboard.
PanelTargets walks rows recursively, returns (panel-title, expr)
pairs filtered to Prometheus targets. BaseURL is a defensive copy
used by the dashboard-hygiene analyzer to build absolute dashboard
URLs in Fix snippets.
New analyzer flags Grafana dashboards whose panel queries reference
missing Prometheus metrics. Skeleton + nil-Graf warning path; full
algorithm in subsequent commits.
Walks every Grafana dashboard, parses Prometheus targets via
promqlx, and groups (dashboard, missing-metric) pairs. Severity
Medium per the design. Recording-rule outputs and fix snippet
come in later commits.
…rename

Code-quality follow-ups to the happy-path commit:
- add Dashboard tiebreaker to the comparator so output is
  deterministic across map iterations
- extract the comparator into findingLess to keep Analyze
  under the gocyclo limit (was 15, the extra branch pushed
  it to 16)
- extract sample-cap 5 to a named const for grep-ability
- rename buildFinding parameter to mirror the call-site
  variable name
Same missing metric across multiple panels yields one finding;
distinct missing metrics in the same dashboard emit separate
findings.
A recording rule whose output is not yet in head series must
still be treated as a known metric. Mirrors the resolution flow
in unusedmetrics, including the VictoriaMetrics graceful-degrade
sentinel.
… test

Code-quality follow-ups to the recording-rule resolution commit:
- expand the two BuildInfo-adjacent comments to explain WHY each
  VM-flavor check exists (404 path vs 200-empty-groups path), so a
  future reader doesn't need to cross-reference unusedmetrics
- strengthen the RR test by querying AlertA from a second panel:
  proves the type filter (r.Type == "recording") actually held,
  not just that the recording-rule output was added to exists
Grafana template variables like ${metric}_total sanitise to
__remetric_var___total - a valid PromQL identifier that parses
cleanly and leaks into the extracted metric set. Change isSentinel
to a Contains check so any name containing the sentinel substring
is treated as a sanitiser artifact and filtered.

Without this, dashboardhygiene and unusedmetrics would treat
template-variable expressions as references to bogus metrics
named '__remetric_var__*'.
…and Loki targets

Grafana template-variable queries (${metric}_total) and non-Prometheus
datasources must not generate findings or warnings. The
template-variable path relies on promqlx filtering sentinel-derived
metric names; the Loki path relies on PanelTargets filtering by
datasource type.
Per-dashboard fetch errors degrade to warnings without aborting
the analyzer. Search() failure is fatal. VictoriaMetrics without
--vmalert emits the recording-rules-unavailable warning.
Renders a paste-ready instruction block: restore the metric or
remove the broken queries. Drops the URL line when no absolute
dashboard URL is available. Caps the panel list at 10 entries
with a '... and N more' tail.
… builtin

Code-quality follow-ups to the fix-snippet commit:
- pull the broken-panel docs URL from findings.DocURL(ClassBrokenPanel)
  instead of a hardcoded literal, so the single source of truth in
  internal/findings/ stays authoritative
- replace explicit limit-clamping with the Go 1.21+ min() builtin
- replace C-style index loop with idiomatic range over the slice
New top-level subject 'dashboards' with one action 'broken'.
Requires --prometheus and --grafana. Honors --output, --min-severity,
--ignore-dashboard, --ignore-metric, --fail-on, --limit, --timeout.
…pty.go

Code-quality follow-ups to the dashboards broken subcommand:
- drop the CLI's local re-sort; the analyzer already orders by
  (severity desc, sample-count desc, dashboard asc, metric asc),
  and the filter passes are stable. The local sort was discarding
  the sample-count tiebreaker - meaningful signal that broken-from-
  many-panels metrics rank higher within a severity tier
- move brokenPanelCopy to empty.go for parity with cardinalityCopy
  and labelPatternCopy
- extend TestEmptyCopy_Values to cover brokenPanelCopy and the
  previously-uncovered unusedMetricsCopy
Both flows include the new analyzer; without --grafana it emits
a warning and zero findings (consistent with unusedmetrics).
…l page

Real content for the broken-panel finding class. Updates the
catalog, the unused-metric cross-link, the mkdocs nav, and the
'What's still missing in v0.1' README section since the analyzer
now ships in v0.1.
Provisions a Grafana dashboard whose only panel queries a metric
Prometheus does not scrape; runs 'remetric dashboards broken' and
asserts the finding is emitted with class=broken-panel.
…text

Follow-up to the analyzer landing. Updates user-facing copy that
still listed the v0.1 analyzer set as four (now five with
broken-panel) and omitted --ignore-dashboard from the ignore-*
table.
@cicdteam cicdteam merged commit 717b7ad into main May 23, 2026
4 checks passed
@cicdteam cicdteam deleted the dashboard-hygiene branch May 23, 2026 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant